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Abstract. In the last years. Distributed Visualization over Personal Computer (PC) clusters has become 
important for research and industrial communities. They have made large-scale visualizations practical 
and more accessible. In this work we survey Distributed Visualization techniques aiming at compiling 
last decade’s literature on the use of PC clusters as suitable alternatives to high-end workstations. We 
review the topic by defining basic concepts, enumerating system requirements and implementation chal¬ 
lenges, and presenting up-to-date methodologies. Our work fulfills the needs of newcomers and seasoned 
professionals as an introductory compilation at the same time that it can help experienced personnel by 
organizing ideas. 
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1 Introduction 

Visualizing huge datasets demands processing power 
that surpasses commonplace workstations. To cope 
with this need, high-end systems using Symmetrical 
Multi-Processors technology (SMP) ^ have been com¬ 
mercialized. Alternatively, during the last decade, a 
great number of works have discussed matters of de¬ 
sign and implementation of visualization over computer 
clusters. These works constitute the Distributed Visual¬ 
ization discipline and range from complete systems to 
prototypes and theories to better utilize distributed com¬ 
putation. 

According to the Top 500 Supercomputer list El, 
currently more than 40 percent of the fastest comput¬ 
ers in the world are clusters of networked computers. 
Among these clusters are the Quake project m at the 
Pittsburgh Supercomputing Center, which is composed 
of hundreds of proprietary systems in a cluster of work¬ 
stations reaching top generation performance. 


age devices, processing power, memory speed, graphi¬ 
cal buses and frame buffers have doubled every period 
around two years or less. These resources lead to a 
graphical power that was formerly infeasible. Today, 
a commodity U$1500 workstation has graphics capa¬ 
bilities that exceed those of a late 1990s U$500K su¬ 
percomputer. These advances, together with network 
improvements, made it possible to build PC clusters to 
rival high-end machines. 

An up-to-date benchmark from the National 
Aeronautics and Space Administration agency (NASA) 
demonstrates not only that commodity computer clus¬ 
ters rival to SGI workstations, but also that the mainte¬ 
nance cost of these workstations is sufficient to build a 
new PC cluster every year. Thus, in this work, we com¬ 
pile the progresses in Distributed Visualization focus¬ 
ing on clusters of PCs. We build a condensed document 
aiming at organizing ideas and discussing unsolved is¬ 
sues. 

In this text we review the complexities to consider 
when designing and implementing Distributed Visual¬ 
ization over clusters of PCs. At the same time, we 


Concomitant to this process, the PC commodity in¬ 
dustry has furiously evolved in the last years tending to 
continue at this pace, see figure Advances in stor¬ 
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Figure 1: Various aspects of commodity hardware 
evolution and tendency for a single processing node 
equipped with a top generation Graphics Processing 
Unit. Except for the network speed data, all the data 
were extracted from Fernando’s seminar csi. 


present the advantages of this platform, including low 
costs for supercomputing, the ability to track technolo¬ 
gies, systems that can be incrementally upgraded, open 
source software and vendor autonomy. 

The organization of the text is as follows. Section 
introduces the Distributed Visualization topic and sec¬ 
tion presents a set of issues related to implementing 
clusters of PCs. Section [^presents libraries that intend 
to abstract the assembling of such systems and section 
[^concludes the paper. 

2 The Visualization Pipeline 
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(Sta^ 1) (Sta^2) (Stage 3) 

O U O 



! RENDERING PI 

[FELINE 1 

Model/w Collect 
Phenomena 

^ Select, FUter, ^ Mapping 

■^Normalize, 

Rasterize ^Image 


Raw /Object-spac^ /Screen-spac^ Pixel^ 
Data \primitive^ Vorimitive^ J 


Figure 2: Three stages named filter, mapper and ras- 
terizer compose the complete visualization pipeline. 
The first of them may be omitted, executed before or 
along the visualization process. In such case, only the 
mapper and the rasterizer stages compose the process. 
These two stages determine the core of the visualization 
pipeline, which is called rendering pipeline (dashed 
rectangle). 


selation data, an isosurface or a contour map. The last 
stage, named rasterizer (or renderer), applies operations 
like projection, lightening and/or shading to the screen- 
space primitives in order to finally generate the image 
(pixels). The whole process is known as visualization 
pipeline, while the last two stages are known as ren¬ 
dering pipeline. We widely refer to the above concepts 
along the text. 



Object-space primitives Screen-space 

Data volumes that may be primitives 
represented as a cube Tesselation data 

_ and as a sphere _ 



Visualization is organized according to the model cast 
by two works, Upson et al ||48]| and Haber and McNabb 
|[25]| . Their model, presented in figurej^ describes three 
stages to achieve a visualization. In a pipeline structure, 
each stage executes a distinct processing whose output 
feeds the next stage. 

After an initial data collecting stage, the pipeline 
proceeds with the raw data volume that is processed ac¬ 
cording to the purposes of the intended visualization. In 
this stage, named filter (preprocessing or traversal), the 
dataset is selected, filtered, cleaned, enriched, summa¬ 
rized, normalized, and/or submitted to any useful pro¬ 
cessing to optimize the rendering process, e.g., culling 
operations. Then, as illustrated in figure the pre¬ 
pared data (object space primitives) are submitted to 
a mapping procedure (geometry transformation) to de¬ 
termine how the data will be displayed in the form of 


Figure 3: Illustration of data that may take part of a vi¬ 
sualization pipeline. Two objects, a cube and a sphere, 
their correspondent volumes and tesselation data fol¬ 
lowed by renderization. 

In the Distributed Visualization domain, the stages 
of the visualization pipeline are not restricted to a single 
machine or location. Each step may take place locally 
or remotely, at the client that will analyze the data or at 
the server that owns the processing power. Accordingly, 
we define Distributed Visualization as the use of dis¬ 
tributed resources to drive visual analysis. It is expected 
that such systems present the possibility of disjoining 
data and exploration sites, the possibility of combining 
autonomous processing resources and the possibility of 
collaborative work. 


geometrical entities (screen-space primitives), e.g., tes- 
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Advances in commodity hardware have led to the 
use of PC clusters as an attractive option for Distributed 
Visualization. To explore these prospects, algorithms 
have been proposed to deal with memory and commu¬ 
nication constraints in distributed environments. This 
approach increases the limits of a single PC, but it has 
to tackle with load balancing and inter-process commu¬ 
nication, among other problems, that we discuss in this 
text. 


3 Design and Implementation Issues for Dis¬ 
tributed Parailei Rendering Systems 

Distributed Visualization systems have been oriented 
to PC clusters (distributed memory) architectures that 
concern parallel rendering in distributed systems. Mo¬ 
tivations include: the low cost of commodity PC hard¬ 
ware is far smaller than that of high-end visualization 
systems; PCs are all-purpose machinery and can also 
be used for non-graphical applications; it is possible to 
benefit from the standards of the PC market, which al¬ 
lows for a continuous upgrade with reduced effort; the 
open-source movement provides high quality software 
at low costs; and, it is possible to add more PCs to the 
system in order to bear with power increase demands. 

The main consideration for Distributed Visualiza¬ 
tion over PC clusters is the algorithm that binds the 
distributed resources into the visualization pipeline de¬ 
fined in sectionSuch algorithms fall into one of three 
categories: sort-first, sort-middle and sort-last (sub¬ 
section O'. having as their design goal concerns on 
load balancing and communication constrains among 
the nodes of the cluster (subsection |3.2[ ). Supporting 
this structure lies network technology (subsection 


|3.3| ), techniques for data management (subsectoin 3.4) 
and techniques for parallel storage (subsection [T5 ). 
Another concern is the operating system over which 
the distributed visualization ensemble will execute 
(subsection |3.6| ). Alternatively, distributed parallel 
rendering libraries offer high-level implementation, as 
reviewed in the next section. 


lelism is achieved through the assigning of transforma¬ 
tion tasks (on object-space primitives) and rasterization 
tasks (on image-space primitives) to the distributed pro¬ 
cessing units. 
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Figure 4: Algorithms for distributed rendering rely on 
the decision of where/when the parallel tasks will be 
designated for the processing units. This is considered 
a sorting/classification problem. There are three possi¬ 
bilities for this sorting operation, each of which deeply 
characterizes the correspondent algorithms. 

The idea is that the complex modeling of visual¬ 
ization scenes, geometrical entities (object-space prim¬ 
itives and screen-space primitives) and rendered pixels 
can be distributed in different ways across the process¬ 
ing resources. The design task, thus, worries about how 
to assign the data portions of the pipeline so that the in¬ 
tended image can be achieved at the end. At the same 
time, processing load and network communication con¬ 
straints must be satisfied. Sutherland et al consider 
it a sorting (or classification) problem. This sorting can 
occur in one of three moments, as presented in figure 
1^ After the sorting is complete, the data need to be 
redistributed among the nodes to refiect the new-sorted 
arrangement. 

The classes defined when considering the render¬ 
ing pipeline and the sorting problem are named sort- 
first, sort-middle and sort-last. They differ by terms 
of bandwidth requirements, amount of duplicated work 
and load balance. 


3.1 Algorithms for parallel tiled Distributed Visual¬ 
ization 

In reference to the foundational visualization pipeline 
described in seciton Molnar et al ll36l formulate 
the most accepted classification for distributed paral¬ 
lel rendering algorithms. Their analysis determines 
how the visualization pipeline (geometric transforma¬ 
tion followed by rasterization) maps onto a general par- 
allel algorithm. According to the theory, the paral¬ 


Sort-first (image (or pixel) space parallelism) 

In sort-first, the screen is divided into disjoint regions 
(tiles) that are assigned to the processing nodes, as il¬ 
lustrated in figure To do so, the objects primitives 
are submitted to a minimum geometry transformation 
(screen-space bounding box calculation) necessary to 
determine which tile of the screen they overlap. Then, 
the objects primitives are transmitted to the process- 
ing nodes (renderers) that correspond to the tiles of the 
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Figure 5: Classical sort-first configuration. Based on 
tiled partitioning information, the data is distributed 
among the processing nodes. After transformation and 
rasterization, a final step assemblies the tiles to form the 
image. 


screen, where both transformation and rasterization are 
performed. Finally, the rendered tiles are reassembled 
to form the final scene. 

The initial geometry transformation phase of 
sort-first is called pre-transformation. Due to this extra 
processing, sort-first is the most expensive design for 
distributed visualization, but also the least bandwidth 
consumer. Examples include the work by Zhu et 
al 1501 and the work by Mueller (391 . The later 
presents a study about sort-first implementation and its 
advantages, mainly the frame-to-frame coherence (high 
for interaction sequences) and the lower bandwidth 
demand. 

Sort-middle 

It is the natural approach for distributed parallel ren¬ 
dering because transformation and rendering are per¬ 
formed at different levels of the cluster, see figure 
Initially, the algorithm distributes object space prim¬ 
itives among the nodes according to some load bal¬ 
ance method, e.g, round robin. Then, after geome¬ 
try processing, the resultant screen space primitives are 
distributed to the rasterization nodes. Similar to sort- 
first, the algorithm assigns tiles of the screen to spe¬ 
cific processing nodes but, differently, there is no pre¬ 
transformation step. 

The disadvantages occur when the tessellation ratio 
is high. Tessellation refers to the decomposing of larger 
primitives into smaller ones. It determines that the sys¬ 
tem must redistribute several display primitives instead 
of just one object primitive. For sort-middle, high tes¬ 
sellation ratios imply in higher communication costs. 

Another disadvantage of sort-middle is the load 
imbalance on the rasterization units if the primitives 
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Figure 6: Sort-middle configuration with two levels of 
processing nodes. The screen-space data communica¬ 
tion may also occur in between nodes at the same level, 
determining a one-level-only structure, where transfor¬ 
mation nodes also act as rasterization nodes. 


are unevenly distributed over the screen, what also 
may occur to sort-first algorithms. Also, according to 
Mueller (391 , the loose coupling in the middle of the 
pipeline can limit feedback from the rasterization stage 
back to the transformation stage, what makes certain 
visibility culling algorithms either less efficient or in¬ 
feasible. Montrym et al ET) present a custom-designed 
implementation of this parallelism and Ellsworth ca 
makes an extensive review of sort-middle systems. 

Sort-last (object-space parallelism) 

In this case, the sorting occurs after the end of the ren¬ 
dering pipeline, that is, the pixels are ready to compose 
the image, as presented in figure In a load-balanced 
manner, the processing nodes (renderers) receive arbi¬ 
trary subsets of the object-space primitives. After trans¬ 
formation and rasterization, the resultant pixels are sub¬ 
mitted to a composition procedure. At this final step, 
sort-last will have produced a set of full-screen images 
(sort-last-full) or a set of screen-space primitives (sort- 
last-sparse). These images are recomposed by hard¬ 
ware or software that compute every sample at each 
pixel to define the primitives’ visibility. This is called 
(depth) sorting and, according to Eoley et al (20]| , it 
relies on the use of Z-buffering. Thus, the processing 
nodes must send, along with the pixels, the correspon¬ 
dent Z-buffers. This need highly increases the required 
bandwidth to the order of gigabytes per second. 

The advantages of sort-last are the better control of 
load balance concerned to the object-space primitives 
and the simplicity of the approach because the process- 
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ing nodes perform the full pipeline independently. Im¬ 
plementation examples include the works by Lombeyda 
et al lEa and by Morel et al (381. 



Figure 7: Sort-last-full configuration. The initial dis¬ 
tribution of tasks considers only load-balancing direc¬ 
tives. In a second step, each processing node is respon¬ 
sible for the complete rendering of a full screen image 
with only a subset of the objects that compose the visu¬ 
alization. At the final step, each screen is superimposed 
following depth information in order to have the correct 
visibility of the visual entities. 

In the literature, each author advocates for her/his 
choice considering the most suitable features for sort- 
first, sort-middle or sort-last. More recent works point 
to sort-first and sort-last to be used with PC cluster 
implementations. Sort-middle is used with high-end 
shared-memory systems as SGFs hardware, probably 
because fast memory buses are less influenced by the 
overheads of sort-middle. For comparison, in table 
we present an overview of the three possibilities. 

Hybrid approaches are also possible, as done by 
Samanta et al ||43]| and Garcia and Shen ED. The 
former work tries to minimize the sort-last composi¬ 
tion overhead by means of a dynamic sort-first parti¬ 
tion. Their approach benefits from a view-dependent 
partitioning of both the 3D model and the 2D screen. 
The later work leverages the advantages of both sort- 
first and sort-last approaches with a hybrid-sorting of 
both image and data partitioning for load balance. 

3.2 Load balancing 

Load balance applies distinctly for image parallelism 
(sort-first, sort-middle) and object parallelism (sort- 
last). In object parallelism, object distribution-rules de¬ 
fine how to reach nearly equal loads among the process¬ 
ing nodes. In image parallelism, load balance is based 
on screen partitioning methods. 


Ellsworth csi points that, for object parallelism, 
random or round robin approaches are used to distribute 
objects among the processing nodes. These techniques 
work fairly well for objects with homogeneous size and 
complexity. For objects with great differences, the time 
to process them may vary by a large factor. Further pos¬ 
sibilities for load balancing consider the geometry as hi¬ 
erarchical structures or as sets of primitives (fiat struc¬ 
tures), this topic is reviewed by Ellsworth et al (TtI . 

For image parallelism, if not equal portions of the 
image are assigned to the processing nodes, some of 
them will remain idle while waiting for others to finish 
their task. This problem is treated by screen dividing 
methods, as exemplified in figurewhich can be static 
or dynamic. 



frame 1 frame 2 


Figure 8: Screen-partitioning with 9 tiles. By means 
of frame coherence, only the primitive highlighted in 
tile 7 must be sent in order to draw tile 8 in frame 2, 
which was formerly empty. The other primitives remain 
in memory via retained mode operation. 

Static screen dividing methods divide the screen 
into more regions than the number of processing nodes 
and assign the regions in an interlaced fashion to these 
nodes. The number of regions per node is called gran¬ 
ularity ratio. For low granularity the workload may not 
be balanced. For high granularity, we may have a high 
overlap factor (the average number of regions over¬ 
lapped by the primitives), what leads to network over¬ 
load in image parallelism. If a primitive lies over three 
regions, the entire primitive must be transmitted three 
times for transformation and rasterization because, even 
if just a small piece overlaps a tile, its computation de¬ 
pends on the entire primitive. According to Molnar et 
al (3611 . if we assume equal sized primitives and equal 
probability for the positioning of these primitives on the 
screen, the overlap factor is given by: 

Ovevlcip = {{Rweight Pweight)/Rweigth) * (1) 

{{Rheight Pheight) / Rheight^ 

W^here Rweight^ Rheightf Pweight 9-Hd Pheight 
respectively, the weight and the height of a given screen 
region and the weight and the height of a given primitive 
bounding box. 
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Table 1: Overview of the main algorithm possibilities for Distributed Visualization. 


Feature 

Sort-first 

Sort-middle 

Sort-last 

Parallelism 

image 

object / image 

object 

Sorted data 

object space primit. 

screen space primit. 

pixels (z-buffer) 

Bandwidth demand 

low 

medium 

high 

Overhead factor 

pre-transform./overlap 

high overlap 

image composition 

Frame coherence 

yes 

no 

no 


Dynamic (adaptive) screen dividing methods work 
by computing statistics from on-screen primitives dis¬ 
tribution in order to intelligently determine and assign 
the tiles. These algorithms add overhead to the visu¬ 
alization process first due to the gathering of statistics 
and decision making they implement; and second due 
to the more elaborated screen division. Two common 
approaches for adaptive screen division are: first to stat¬ 
ically partition the screen and then dynamically allocate 
them to the processing nodes; another method is to first 
settle a constant number and assignment of regions and 
then dynamically vary their shape, as done by Roble 
ia. Also, dynamic partitioning combined with dy¬ 
namic assignment is possible, as proposed by Mueller 
(391. Finally, a comparison of algorithms for space di¬ 
vision is available in the work by Kurc et al (301 . 

3.3 Network issues 

The great disadvantage of PC clusters if compared to 
shared bus (SMP) systems is that data sharing does 
not occur over a high speed direct access memory bus, 
but over a much slower network. Thus, network fac¬ 
tors demand special attention in order to reach effective 
bandwidth and network latency performances. Band¬ 
width corresponds to the available data/time transmis¬ 
sion. Network latency is the time to prepare and trans¬ 
mit the data between two nodes. 

Bandwidth constraints are affected by the network 
speed, the data-network adapter speed, the bus interface 
and the memory. That is, the bandwidth is a function of 
the data traveling time, receiving time, in-node transfer 
time and memory storage time. 

Meanwhile, the network latency varies with the net¬ 
work interface that sends/receives the data, the bus in¬ 
terface to in-node read/transfer the data, the memory 
architecture to access/store the data and the processing 
power available to decide and perform the whole pro¬ 
cess. High network latency times barely affects scarce 
long message communication, e.g. Internet browsing, 
but it is decisive for communication characterized by 
plentiful short messages, as required by computer clus¬ 
ters. 


These factors must be designed to maximize the 
bandwidth at the same time that the latency be mini¬ 
mized. Together with high quality hardware and system 
architecture, an appropriate network must be settled. A 
suitable practice is to isolate the cluster in a network in 
which traffic is limited to the cluster communication. 
This setting constitutes a System Area Network (SAN). 
In such systems, the hubs must have minimal retransfer 
latency and the interconnection of different networks 
must be avoided due to higher latency. For optimiza¬ 
tion, a switch device, instead of a hub, can serve the 
network so that intelligent directional ports permit 
parallel communication. Following we review major 
factors to come up with a suitable network structure for 
distributed parallel rendering. 

The Message-Passing Interface (MPI) standard 

The MPI standard d, is a message-passing library 
being developed since 1992. A message passing li¬ 
brary is a high-level abstraction that permits inter¬ 
communication within a collection of autonomous pro¬ 
cesses each of which with its own local memory. It 
eases the implementation of shared memory and dis¬ 
tributed shared memory systems, which are the founda¬ 
tion of computer clusters parallelism. There are several 
message-passing libraries, but the MPI standard is the 
de facto convention for clusters. 

MPI ranges from supercomputing to PC clusters. It 
is a standard and not a product, implementations of it 
are available for several operating systems and network 
technologies. Liu et al (^ present a performance 
comparison of MPI implementations over InfiniBand, 
Myrinet and Quadrics technologies (detailed further in 
this section). Gropp et al describe the MPICH, 
a portable implementation of MPI (“CH” stands 
for “Chameleon”). MPICH is the most used free 
distribution of MPI and its design goal is to combine 
performance with high portability within a single 
implementation. Another popular implementation of 
MPI is Romio US, with broad portability and free 
on-line support. 
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Via technology 

In stacked-based communication protocols, as TCP/IP, 
great part of the latency is caused by the in-node oper¬ 
ation. When a process in a cluster node wants to trans¬ 
mit, it prepares the data and makes a high-level call to 
some network API that access the network hardware. 
Then, when the respective interruption request (IRQ) 
is issued, the OS copies the data to some other mem¬ 
ory space used as buffer. This buffer, finally, is read by 
the network hardware that transmits it via the physical 
layer. This process consumes a large amount of time per 
data transmission, lowering the process by up to orders 
of magnitude. 

To lessen this problem, it was conceived the Vir¬ 
tual Interface Architecture (VIA) da. VIA, created 
by Intel, Compaq and Microsoft, describes an alterna¬ 
tive interface between network hardware and user soft¬ 
ware. This interface provides direct access to the net¬ 
work hardware (user level communication protocol), 
what lowers the transmission latency by avoiding the 
OS intermediation (zero-copy protocol). VIA is accom¬ 
plished by hardware integration on the Network Inter¬ 
face Card (native implementation), or by software em¬ 
ulation. The former achieves the best performance, the 
later consumes extra processing, but even though its la¬ 
tency performance surpass that of regular network us¬ 
age, according to a IEEE report Q. 

Cameron and Regnier ca present complete infor¬ 
mation about VIA. Baker et al cni describe a study 
on VIA performance gains over Gigabit Ethernet. In 
it can be found information about the MVICH 
(MPICH for Virtual Interface Architecture), a popular 
implementation of MPI on top of VIA technology. 
Also in 1^ it is described the Modular VIA (M-VIA) 
a high performance implementation of VIA for Linux 
systems. The use of VIA is a straight method to 
diminish the effects of network latency, which is the 
main limitation in computer clusters. 


workstations, it offers up to 2 Gigabit/second full du¬ 
plex links and it is based on the ANSI (American Na¬ 
tional Standards Institute) Standard ANSI/VITA 26- 
1998. The Quadrics technology ED reaches up to 
8.5 Gigabit/second full duplex rates on top hardware 
systems provided with the QsNet II network. Infini- 
Band El is steered by an association of member com¬ 
panies involved in performance computing, data center 
and storage implementations. It offers up to 30 Giga¬ 
bit/second channels. The Scalable Coherent Interface 
(SCI) network, which is an ANSI/ISO/IEEE Standard 
(1596-1992), has been specifically designed to com¬ 
puter clusters. With reduced network latency time, SCI 
behaves like a bus or a network using point-to-point 
links to achieve higher speed. It implements a cache 
scheme as a coherent virtual shared memory. Its Dol¬ 
phin El release reaches up to 2.6 Gigabit/second rates. 

In table we present an overview of these tech¬ 
nologies together with the Gigabit Ethernet commodity 
technology. The data are only for rough comparison be¬ 
cause a number of other factors may infiuence the per¬ 
formance and costs. 

All these technologies have support for VIA (native 
or emulated) and for implementations of MPI. The 
choice for one of them depends on other factors such as 
compatibility with the cluster equipment and operating 
system, performance and price. Latency is decisive 
for massive short message communication, thus, the 
ill latency performance of Ethernet makes it the worst 
choice. Quadrics and Infiniband present superior band¬ 
width coupled with very attractive latency times. The 
drawback is the elevated price of these options. More 
adequate alternatives are Myrinet and SCI networks. 
Myrinet has already been widely used for clustering, 
while SCI presents the best latency performance. In Q 
it is presented a wide description of these technologies 
and Yeo et al 1491 present a benchmark-oriented study 
about the topic. 


Network technologies 

Transmitting 3D objects over the network, as for sort- 
middle algorithms, can compromise the available band¬ 
width. Thus, research is performed to devise geom¬ 
etry compression algorithms, as done by Touma E2l- 
These algorithms reduce bandwidth requirements to up 
to 10 bits per vertex, including connectivity. How¬ 
ever, the required bandwidth and network latency still 
can overscale commodity Ethernet networks. To cope 
with that, high-performance interconnection technolo¬ 
gies are used. We list them on the next paragraph. 

Myrinet CD is a packet-communication and 
switching technology used to interconnect clusters of 


3.4 Data management 

Distributed Visualization deals with terabyte order 
datasets over heterogeneous platforms. The storage 
and utilization of this information have specific im¬ 
plications, specially the required physical space, the 
EO tasks to be performed in suitable time and the 
applications’ expected data format. Therefore, three 
efforts have emerged as leading initiatives to determine 
standards in scientific large volume data management: 
HDF (3, CDF (61 and netCDF (Ml. 
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Table 2: Network technologies overview. 


Network technology 

Bandwidth (MB/s) 

Latency (/is) 

Avg. Price/Port (U$) 1401 

Gigabit Ethernet 

< 125 

< 100 

- 300.00 

10 Gigabit Ethernet 

< 1250 

< 60 

- 7000.00 

Myrinet (1211 

<250 

< 10 

^ 400.00 

Quadrics QsNet II (4IB 

< 1064 

< 3 

- 2000.00 

Infiniband O 

< 3750 

<7 

- 800.00 

sciH 

< 326 

1-2 

^ 800.00 


HDF, CDF and netCDF provide a platform- 
independent library via a high-level API. The stored 
data can be of any dimensionality and of several forms 
(numerical, string, images), it can be randomly read 
or written, unitarily or in blocks. HDF employs a 
more flexible data model (hierarchical) than netCDF 
and CDF (multidimensional array) and, according to Li 
etal ED, this flexibility comes at the cost of higher pro¬ 
cessing loads. The three formats are nearly equal refer¬ 
enced in the literature, they present equivalent features 
and performance. The choice for one of them should 
consider compatibility with the target system in hard¬ 
ware, software and development language. 

3.5 Parallel File Systems 

In Distributed Visualization, it is possible to have tens 
of machines simultaneously accessing the same tera 
byte dataset. A single disk device cannot cope with 
these needs. Parallel file systems were designed to deal 
with this problematic. These systems are designed on a 
client-server basis with multiple servers running a sort 
of I/O daemon. The parallel file system strips the flies 
and store them across these servers. To retrieve infor¬ 
mation, the system reassembles a desired file and trans¬ 
mits it to the client. The whole process occurs automat¬ 
ically via calls to a user level library. Other function¬ 
alities like permission checking for file creation, open, 
close and removal are supported by an auxiliary man¬ 
ager process that handles meta data during the system 
operation. 

Parallel file systems have to be robust and scalable, 
conform to existing I/O APIs for backward compatibil¬ 
ity, maintain addressing file semantics, provide transac¬ 
tion support and be easy to use and install. Among the 
most popular implementations of parallel file systems 
for commodity PCs are the Lustre system (281, from 
Cluster File Systems Inc., released as open-source soft¬ 
ware, and the Parallel Virtual File System (PVFS) ifTH . 
also open-source, both for the Linux platform. The later 
is in its second release, which presents a number of im¬ 
proved features and a new design. 


Real parallel file systems are very complex. Maybe 
that is the reason why there is just a few implemen¬ 
tations available for PC clusters. Comparing the op¬ 
tions is not simple, as their complexity confer them 
a great number of features that are difficult to bench¬ 
mark. Margo et al Ea perform an extensive analy¬ 
sis of PVFS, Lustre and GPFS, however no categorical 
conclusions are drawn being up to the analyst to de¬ 
cide which one to choose. With the release of Lustre as 
open-source and with the emergence of PVFS version 
2, these systems tend to evolve providing regular new 
features. 

3.6 Operating System 

A report from Silicon Graphics observes E) that the 
operating system (OS) is replicated at each machine 
of a PC cluster leading to costs increase for each new 
node. License expenses, memory and processing re¬ 
quirements of each operating system instance sum up 
to a great burden. According to Yeo et al (491 . be¬ 
sides these factors, the choice for the operating system 
in a cluster must consider: manageability, management 
and administration of local and remote resources; sta¬ 
bility, robustness against system failures with system 
recovery; performance, optimized efficiency for OS 
tasks; extensibility, ability to easily integrate cluster- 
specific extensions; scalability, scale without perfor¬ 
mance degradation; support, user and system admin¬ 
istrator support; and heterogeneity, support to multiple 
architectures to define a cluster consisting of heteroge¬ 
neous hardware. 

Another consideration is the OS configuring like¬ 
ness to enable variable configurations and customized 
optimizations. Choices in the market point to Unix pro¬ 
prietary solutions, to expensive Windows easy-to-set 
systems and to low-cost flexible (open-source) Linux 
systems. According to the worldwide top 500 hundred 
supercomputers list, reported by the Forbes magazine 
El, the Linux platform has beaten competitors as the 
main choice for supercomputing. 
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3.7 Technologies Summarization 


To link much of the information provided so far, in fig¬ 
ure 1^ we present the I/O structure of a cluster of com¬ 
puters with optimized data access. At the top layer is the 
parallel execution which is responsible for the data pro¬ 
cessing according to one of the parallelisms described 
in section |3.1| These algorithms are load-balanced ac¬ 


cording to the discussion carried out in section 3.2 


Image 


I 


Parallel Execution 

i MPI-VIA I/O 1 

Data Access Level 

Data Management Library 

i MPI-VIA I/O 1 


Parallel File System 
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Storage Hardware 




Figure 9: The layers of a Distributed Visualization sys¬ 
tem. Storage devices are at the lowest layer abstracted 
by parallel file system access. The MPI standard, along 
with VIA technology, provides easy multi-point com¬ 
munication for data management libraries, such as HDF 
and netCDF, that feed the application with the data to be 
processed. At the highest layer, the parallel execution 
is performed to produce visualization images. 

This model illustrates the state-of-the-art implemen¬ 
tation strategy for Distributed Visualization concerning 
large datasets. The methodologies and technologies to 
be used at each layer depend on several factors as dis¬ 
cussed along the text. Currently, works in the litera¬ 
ture deal about finding the better settings for this model 
and/or to simplify it with more abstracting layers. This 
last issue is introduced in the next section in the form of 
distributed parallel rendering libraries. 

The layers in figureare assisted by optimized net¬ 
work technologies, as presented in section [33| Among 
these technologies, the MPI standard, combined with 
VIA technology, is present at each layer of the model 
by providing simplified optimum access to remote data. 
To efficiently promote data access, it is necessary to 
use data management libraries, like those presented in 
section 13.41 These libraries access the information at 
the lowest layer, where lies the storage hardware pro¬ 
viding voluminous data access. To abstract the storage 
hardware, parallel file systems, like the ones described 


in section [33] provide parallel high-performance trans¬ 
parent access. 

4 Distributed paraiiel rendering iibraries 

Attempts have been carried out to abstract Distributed 
Visualization through graphical libraries. The goal 
is to allow ordinary graphical library calls and have 
simplified management of distributed processing units 
as a visualization cluster. Earlier works met this goal 
but are not based on commodity hardware. Later 
proposals address commodity PCs. 

WireGL 

The WireGL library 1^ replaces the OpenGL driver to 
enable OpenGL in Distributed Visualization environ¬ 
ments. By preserving the OpenGL API, applications 
can run on top of WireGL without recompilation and 
be provided with performance speedups, according 
to Humphreys et al 1^ . WireGL supports one or 
multiple clients simultaneously sending commands and 
data to one or multiple servers in sort-first parallelism. 
It intercepts regular OpenGL commands and send 
them to servers over the network. It also implements 
a network protocol for geometry communication 
and performs final image reassembly in software or 
hardware. 

Chromium 

Chromium EH is an advanced derivation from WireGL 
that similarly overlays OpenGL for compatibility. It 
supports the use of stream processing units, or SPUs, 
that perform specific rendering tasks. The SPUs can 
be chained to achieve a complex rendering execution. 
Chromium’s architecture primes for its general orienta¬ 
tion and flexibility. The SPU chain can be configured 
arbitrarily and both sort-first and sort-last parallelisms 
have been achieved, according to Humphreys et al 
(271. The drawback of Chromium’s architecture is that 
its performance is influenced by its stream orientation, 
which cannot efficiently exploit frame coherence. 

OpenSG 

OpensG 1461 is a scene graph multi-threaded API, 
specially designed for Virtual Reality applications. 
The scene graph metaphor (or hierarchical graphics 
database) organizes a graphical model as a graph that 
can manage visual entities hierarchically. By main¬ 
taining a copy of the scene graph for each thread, 
the threading system copes with distributed rendering 
because various servers can simultaneously respond 
to interaction (graph changes). To do so, OpenSG 
bears a client-server setup to replicate data on multi- 
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Figure 10: Example of a scenegraph defined with the 
Generalized Scene Graph API ca. The hierarchical 
structure of the graph provides eased handling of com¬ 
plex scenes via propagated operations along the paths 
of the graph. Reproduced with permission granted by 
Jurgen Dollner and Klaus Hinrichs. 

pie machines that receive broadcasts informing of graph 
changes every frame. The library is flexible and can be 
used for implementing sort-first, sort-middle and sort- 
last algorithms. A similar library, also scene graph ori¬ 
ented, is the OpenRM project im 

The general orientation and high-level style of these 
libraries cause them to be less scalable than specific 
optimized implementations. This limitation is clearly 
demonstrated by Cribble et al 1^ who ported their 
Simian project both to a customized cluster implemen¬ 
tation and to the top of the Chromium framework for 
performance comparison. Other issues are fiexibility 
and compatibility. 

5 Conclusions 

We have surveyed basic concepts on Distributed Visual¬ 
ization. The presented content aims at elucidating what 
a Distributed Visualization system is, how it is charac¬ 
terized and what issues involve its design and imple¬ 
mentation. We provide to beginners an introductory di¬ 
rection both for research and development and, for more 
experienced readers, we provide an analytical view of 
such systems. We have focused on distributed parallel 
rendering architectures, a cluster-based systematization 
that has popped up in the literature as works that ex- 
plore fiexible commodity low-cost PCs. These imple¬ 


mentations have reached great performance levels and 
scalable architectures that evolve to the pace of market 
innovations. 

Many challenges still have to be bypassed in 
Distributed Visualization. Although the higher per¬ 
formance of PC clusters, their power is far from 
workstations as Silicon Graphics’s InfiniteReality4 en¬ 
abled systems, which scales up to 20.6 Gpixel textured 
antialiased pixels filled per second, or further. Robust 
real-time rendering for dynamic datasets is also an 
open challenge. Of-the-shelf Distributed Visualization 
software to amplify collaborative analytical work 
has not been accomplished either. We expect that 
this work can stimulate the quest for these goals by 
providing a source of information about the Distributed 
Visualization expertise. 
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